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ABSTRACT 

Sketch-based face recognition is an interesting task in vision 
and multimedia research, yet it is quite challenging due to 
the great difference between face photos and sketches. In 
this paper, we propose a novel approach for photo-sketch 
generation, aiming to automatically transform face photos 
into detail-preserving personal sketches. Unlike the tradi¬ 
tional models synthesizing sketches based on a dictionary of 
exemplars, we develop a fully convolutional network to learn 
the end-to-end photo-sketch mapping. Our approach takes 
whole face photos as inputs and directly generates the cor¬ 
responding sketch images with efficient inference and learn¬ 
ing, in which the architecture is stacked by only convolu¬ 
tional kernels of very small sizes. To well capture the person 
identity during the photo-sketch transformation, we define 
our optimization objective in the form of joint generative- 
discriminative minimization. In particular, a discriminative 
regularization term is incorporated into the photo-sketch 
generation, enhancing the discriminability of the generated 
person sketches against other individuals. Extensive experi¬ 
ments on several standard benchmarks suggest that our ap¬ 
proach outperforms other state-of-the-arts in both photo¬ 
sketch generation and face sketch verification. 

Categories and Subject Descriptors 

1.2.10 [ARTIFICIAL INTELLIGENCE]: Vision and Scene 
Understanding 

Keywords 
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1. INTRODUCTION 

Sketch is an important artistic drawing style and may be 
the simplest form since it is only composed of lines. An 

* Corresponding author is Liang Lin. This work was sup¬ 
ported by the Hi-Tech Research and Development Program of 
China (no.2013AA013801), Guangdong Natural Science Founda¬ 
tion (no.S2013050014548), and the Hong Kong Scholar program. 

Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full cita¬ 
tion on the first page. Copyrights for components of this work owned by others than 
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re¬ 
publish, to post on servers or to redistribute to lists, requires prior specific permission 
and/or a fee. Request permissions from Permissions@acm.org. 

ICMR’15, June 23-26, 2015, Shanghai, China. 

Copyright © 2015 ACM 978-1-4503-3274-3/15/06 ...$15.00. 

DOT: http://dx.doi.org/10.1145/2671188.2749321 


interesting application is searching image databases using 
free-hand sketch queries PUCE]. 

However, drawing a vivid sketch portrait is time consum¬ 
ing even for a skilled artist. Automatic face sketch genera¬ 
tion has been studied for a long time and it has many useful 
applications for digital entertainment m- 

Another important application based on face sketch is to 
assist law enforcement. Assumed that we need to automati¬ 
cally retrieval a photo from the database for a query image, 
which help the police to narrow down the suspect quickly. 
Unfortunately, the photo of the suspect maybe unavailable 
in most cases. To deal with such a problem, the best substi¬ 
tute available is an artist drawing based on the recollection 
of an eyewitness. 

In this situation, we only focus on sketches without exag¬ 
geration, so that the sketch can realistically reflect the real 
person. Figure [l] shows two samples of photo-sketch pairs in 
CUHK Face Sketch Database |24L 



Figure 1: Samples of photo-sketch pairs: (a) photos; 
(b) sketches drawn by artist 

Figure [T| indicates the great difference between photos and 
sketches. Thus, photo based face verification methods can¬ 
not be directly applied in this problem. The key to sketch 
based face verification is to reduce the modality difference 
between photos and sketches. 

One intuitive idea is to recover the photo image from a 
sketch. However, it’s an ill-pose problem because a sketch 
may lose many informations during the drawing procedure. 
An alternative way is to generate a pseudo-sketch from a 
photo, which has been discussed in Hamm]- 

The original generation scheme is trying to find a trans¬ 
formation from photos to sketches. Intuitively, it should 
be a complex nonlinear mapping. Thus, former works like 
22 [I3j Ell EH E3 [23| '2(1 simplify the generation problem 
to synthesis form. The underlying assumption is that, if two 
photo images (or patches) are similar, their corresponding 
sketch images (or patches) should also be similar. Thus, if 







we find out a way to use other photo images (or patches) 
to synthesize a photo image (or patch), then we can use the 
corresponding sketch images (or patches) to synthesize its 
pseudo-sketch. 

However, this simplification may has some limitation, es¬ 
pecially the scalability. The time cost in synthesis based 
method grows linearly with the amounts of training data, 
because it need to use the samples in training set to synthe¬ 
size a new one. 

On the contrary, our method try to directly solve the gen¬ 
eration problem with an end-to-end model called fully convo¬ 
lutional network (FCN), which can be regarded as a special 
kind of convolutional neural networks (CNNs). More pre¬ 
cisely, FCN is stacked by only convolutional layers. It can 
solve complex nonlinear problem while producing pixel-wise 
outputs, which is very suitable for the photo-sketch gener¬ 
ation problem. Please refer to Figure [2] to get an intuitive 
idea. 

The main contributions of this work are three-folds. First, 
an end-to-end photo-sketch generation model is studied with 
a novel architecture of fully convolutional networks, which is 
original to the best of our knowledge. Second, we present a 
joint generative-discriminative formulation driving the net¬ 
work optimization, such that the generated face sketches can 
characterize the person detail as well as the discriminability 
against other individuals. Third, our system demonstrates 
superior performances compared with other state-of-the-art 
approaches on several benchmarks. 

Rest of the paper is organized as follows: Section 2 dis¬ 
cusses the relative work in photo-sketch synthesis and con¬ 
volutional neural networks. In Section 3, we will interpret 
our model and its implementation. In Section 4, extensive 
experiments suggest that our approach outperforms other 
state-of-the-art methods on several benchmarks. In Section 
5, we draw the conclusion of our work and list some ideas 
we may adopt in the future. 


2. RELATIVE WORK 

2.1 Photo-Sketch Synthesis and Recognition 


In recent ten years, there has been several works on the 
topic of photo-sketch synthesis and recognition. 

Tang and Wang [22] were the first to address the problem 
with a significant amount of database. They proposed a syn¬ 
thesis method based on eigen transformation (ET). It was 
under the assumption that the photo-sketch mapping can 
be approximated as linear, yet this assumption may be too 
strong especially when the hair region was included. Then, 
it used reconstruction error based distance for recognition. 

Liu et al. m proposed a nonlinear face sketch synthesis 
method called locally linear embedding (LLE), which can 
be considered as an improved version of [22]. Instead of 
modelling the whole image, it applied ET on local patches 
with overlapping. It adopted a kernel-based nonlinear LDA 
discriminative classifier for sketch recognition. 

Another local-based strategy proposed by Xiao et al. [26] 
was based on Embedded Hidden Markov model (E-HMM). 
They transformed the sketches to pseudo-photos and applied 
eigenface algorithm for recognition. 

In [24], |Wang and Tang proposed a multiscale Markov 
Random Fields (MRF) for face photo-sketch synthesis, which 
can be both applied to photo-to-sketch synthesis and sketch- 
to-photo synthesis. They evaluated a series of classifiers on 


the pseudo-images. Random Sample LDA (RS-LDA) per¬ 
formed best among them. Zhang et al. [28] then improved 
the MRF framework by adding shape priors and descriptors 
robust to lighting variations. 

Most of methods above may lose some vital details, which 
influenced the visual quality and face recognition perfor¬ 
mance. Thus, Zhang et al. m added a refinement step on 
existing approaches. They applied a support vector regres¬ 
sion (SVR) based model to synthesize the high-frequency in¬ 
formation. Similarly, Gao et al. [6, proposed a new method 
called SNS-SRE with two steps, i.e. sparse neighbor selection 
(SNS) to get an initial estimation and sparse-representation- 
based enhancement (SRE) for further improvement. 

2.2 Pixel-Wise Predections via CNNs 

Convolutional neural network (CNN) was widely used in 
computer vision. Its typical structure contains a series of 
convolutional and pooling layers as feature extractors, and 
several full connected layers as classifiers to give prediction. 
It has achieved great success in large scale object classifica¬ 
tion, localization and detection nni 0 ei si ed 0. 

One important application of CNN was to produce dense 
or even pixel-wise predictions. Sermanet et al. El devel¬ 
oped a CNN-based framework called OverFeat, which in¬ 
tegrated recognition, localization and detection. The key 
component was a pooling layer with offsets, which can imi¬ 
tate sliding window technique and produce dense outputs in 
the final layer. A similar idea called spatial pyramid pooling 
(SPP) was proposed by He et al. [8], which was also modi¬ 
fied the last pooling layer. Wang et al. [25] proposed a joint 
architecture for generic object extraction, and Luo et al. m 
applied a deep decompositional network for pedestrian pars¬ 
ing. Both of them shared an idea, which is adopting a full 
connected layer on the top of the network to produce a dense 
prediction. Moreover, patch-by-patch scanning technique on 
the original image has been studied by Ciresan et al. I] in 
neuronal membrane segmentation and Far abet et al. [5] for 
scene labeling. 

Recently, Dong et al. [3] applied CNN for image super¬ 
resolution and it can produce a pixel-wise output. They 
discussed its relationship to sparse-coding-based methods, 
and concluded that they can use three convolutional layers 
to simulate the representation-mapping-reconstruction pro¬ 
cedure in sparse coding. 

Our work was mostly inspired by 3 . We conducted our 
early experiments via their network and then further im¬ 
proved it. We called it fully convolutional network (FCN, 
a similar idea was also proposed in nn), for it only con¬ 
tains convolutional layers and the corresponding activation 
function, but without any other layers like pooling, full con¬ 
nected, local response normalization (LRN), etc. We would 
further discuss it in Section 3.2. 


3. PHOTO-SKETCH GENERATION 
3.1 Formulation 

In our model, we use two constraints for sketch genera¬ 
tion. The first one is the generated sketches should be as 
close to the ones drawn by artists as possible. This encour¬ 
ages the deep network to learn the sketching skills from the 
artists. The second one is the generated sketches should 
be able to facilitate the law enforcement. That is, given a 
sketch drawn by the artist, it should be able to identify the 





subject in the photo database. These two constraints are 
introduced as generative loss and discriminative regularizer 
in our objective function, which is similar to 2 . 

Suppose there are N subjects in the training set with 
each subject containing one photo P* and one sketch Si. 
We use /(W, Pi) to denote the sketch generated by a fully 
convolutional network parameterized by W. We define our 
loss function as follows where L gen and Ldiscrim represent 
the generative loss and the discriminative regularizer respec¬ 
tively. Here we use a to control the weight of the discrimi¬ 
native regularizer to the overall objective. 

L(P, S, W) - L gen {P , W) + aL discrirn {P , W) (1) 

For the generative loss, we use a straight function which 
is defined as the pixel-wise difference between the ground- 
truth sketch and the generated one. 

i N 

L gen (P,S,W) = -J2(Si-f(W,P i )f (2) 


For the discriminative regularizer, we encourage the drawn 
sketch of one particular person should be different from the 
generated sketch of another person as defined by the follow¬ 
ing function. 


Ldiscrim (P,S, W) = 

N(N-l) E E log ( 1 + 

' Z=1 j=l,j^i 


(Si-fWPj)) 2 

e A 


(3) 


A is a parameter used to avoid numeric overflow. 

Parameter Optimization Using this model, the learn¬ 
ing process is to minimize the loss function L(P, S', W). This 
is achieved by the standard network propagation algorithm 
in the batch training manner. More precisely, in each iter¬ 
ation, we randomly select a set of photo-sketch pairs and 
construct the generative loss items and discriminative reg¬ 
ularizer items respectively. In order to derive the gradient 
with respect to the network parameter W, the key step is 
to calculate the partial derivative of the loss function with 
respect to the output (generated sketch) of each photo. As 
one photo may be involved into several cost items, we need 
to go through each items to accumulate the derivative with 
respect to the output. Algorithm [l] gives the details of the 
learning process. 


3.2 Fully Convolutional Representation 

In this subsection, we will discuss the property of convo¬ 
lutional layers and how they can be cascaded as a pixel-wise 
fully convolutional representation of an input image. 

Composition of Convolutions A typical convolutional 
layer has K kernels with activation function / can be for¬ 
mulated as 

Vij = /((Wfc * x)ij + b k ) (4) 

x denotes the input feature maps, k G {1,2, ...,K} de¬ 
notes the index of a filter, and W k and bk are the k- th filter 
weights and bias, y - denotes the element at the coordinate 
(z, j) on the k- th output feature map. 

Equation Q indicates that the convolutional operation 
preserves the spatial relationship. What’s more, composi¬ 
tion of convolutions will not change this property, i.e. we 
can use a stack of convolutional layers to represent a com¬ 
plex nonliear mapping, which can be adopted on an input 


Algorithm 1 Parameter Optimization 

Require: 

Training photos {Pi} and sketches {Si}; 

Ensure: 

Network Parameters W 
1: while t < T do 

2 : £<-£ + 1 ; 

3: Randomly select a subset of photos and sketches 

{P'} : {Sj} from the training set; 

4: for all P[ do 

5: Do forward propagation to get /(W, P[) 

6: end for 

7: AW = 0 

8: for all P[ do 

9: Calculate the partial derivative with respect to the 

output: a/( w,p') 

10: Run backward propagation to obtain the gradient 

with respect to the network parameter: AW* 

11: Accumulate the gradient: W+ = AW* 

12: end for 

13: W* = W t_1 - A t AW 

14: end while 


of arbitrary size, and then produce a corresponding spatial 
output. 

Relationship to the Patch-Wise Representation In 

Equation Q, a pixel on the output feature map only will 
be influenced by a patch on the input feature map, i.e. its 
receptive field. This property is also maintained under com¬ 
position of convolutions. Thus, fully convolutional repre¬ 
sentation on the whole image can be regarded as a batch 
of patch-wise representation on patches of the image. Nev¬ 
ertheless, fully convolutional representation is much more 
efficient, for the patches overlap significantly (we set stride 
to 1 in all layers). 

Border Effect Trade-Off Convolutional operations will 
shrink the size of the feature map. For example, if we adopt 
a (3 x 3) convolution operator on a (3 x 3) input, the output 
size will be shrank to (1 x 1). 

As more convolutional layers stacked up, more shrinks will 
be accumulated in the outputs. A possible solution is to add 
proper padding before the convolution operation. But it will 
bring on border effects , which has been claimed in 3:. 

Thus, in the trade-off, we don’t use padding in the con¬ 
volutional layers during both the training and testing time. 
Each (155 x 200) photo image will be shrank to (143 x 188) 
in the representation (See Figure [5]). 

3.3 Implementation 

Pre-processing As it is described in [24], all the photos 
and sketches are translated, rotated, and scaled such that 
the two eye centers of all the face images are at fixed posi¬ 
tion in the pre-processing step. This simple geometric nor¬ 
malization step makes the same face components in different 
images in roughly alignment. 

Another pre-processing step is inspired by the result^] 
of [24]. The transformation mentioned above produces a 
(200 x 250) image, but it may have some black regions on 
the border areas. We crop the (155 x 200) center part of the 
image in order to exclude this negative influence. 

* ht t p: / / www. ee. cuhk. edu. hk/ ~ xgwang / sket ch_mult iscale. ht ml 













Figure 2: The overview of our model. It takes a full-size photo image as input and directly generates a 
full-size pseudo-sketch as output. The middle part is the architecture of our fully convolutional network. It 
contains six convolutional layers, with rectified linear units as activation functions (omitted in the figure). 


Since we choose to avoid border effects, the sketch images 
need to be cropped to (143 x 188) to fit in the network output 
dimension. 

Spatial Patch-wise Learning with Overlapping From 
the previous works on photo-sketch synthesis [151 f24] . patch- 
wise learning with overlapping is very important to handle 
the non-linearity between photos and sketches. Intuitively, 
patches in different positions are diverse from others (e.g. 
eye patches, nose patches and mouth patches). Therefore, 
learning different patch representations in different spatial 
positions respectively is a very straightforward idea. 

We handle it by adding additional XY channels in the in¬ 
put, i.e. the input image data contain five channels, three 
RGB channels of the photo image, and two channels of the 
corresponding coordinate This tiny modification sig¬ 

nificantly improves the result, which will be discussed in 
Section 4.3. 

Network Architecture We apply the network proposed 
in p|] in our early experiments, then we modify its architec¬ 
ture for further enhancement. We borrow the idea in m, 
which has a great success in ILSVRC-2014. Their main con¬ 
tribution is to deeper the CNN under computational con¬ 
straints via very small (3 x 3) convolutional filters in all 
layers. 

Similarly, we modify the 3 layers network in [3, into a 6 
layers network, which is shown in Figure [ 5 ] For the (9 x 
9), (lxl), (5 x 5) convolutional layer in [3], we replace 
them with two (5 x 5), (1 x 1) and (3 x 3) convolutional 
layers respectively. Moveover, we double the filter amounts 
in every layer to improve the network’s capacity. 

We call it medium network due to its size. We have two 
similar architectures called small network and large network, 
which will be further discussed in Section 4.3. 

Training Details Our results are based on our medium 
network (See Figure [ 5 ]). As mentioned above, we first pre- 
process the photo images into (155 x 200) and the sketch 
images into (143 x 188). The optimization objective has 
been discussed in Section 3.1. We set A as 10 9 and set a as 
10 4 . 

We use Caffe [9] for implementation. The filter weights 
are initialized by drawing randomly from a Gaussian dis¬ 
tribution with zero mean and standard deviation 0.01, and 
the bias are initialized by zero. We set the learning rate as 
10 —11 , then it takes several hours to converge on a NVIDIA 
Tesla K40 GPU. 


4. EXPERIMENTS 

We evaluate our model on CHUK student dataset j24] . 
It includes 188 faces. 88 faces are selected for training and 
remaining 100 faces are selected for testing. For each face, 
there is a sketch drawn by the artist and a photo taken in 
a frontal pose, under normal lighting condition, and with a 
neural expression. 

We mainly do three aspects of experiments, which will 
be shown in the next three subsections. Section 4.1 will 
discuss our photo-sketch generation results and comparison 
with the synthesis based method in ED- Section 4.2 will 
show that our generated pseudo-sketches significantly reduce 
the modality between photos and sketches. Thus they can be 
adopted to a sketch-based face verification system. Section 
4.3 will involve empirical study about our model. Several 
factors such as network depth and filter numbers will be 
discussed on the evaluation protocol both qualitatively and 
quantitatively. 

4.1 Face-Sketch Generation 

In Figure [3] we show some examples of our generated re¬ 
sults and compare them to [24]. For fair comparison, we list 
all 14 pseudo-sketches which can be found on their website. 

Figure [ 3 ] indicates that our results contain more vital de¬ 
tails. For example, the man in column (e) of the first row 
has two little locks of hair on his forehead. But this detail 
is missing in the pseudo-sketch in m (in column (g)). On 
the contrary, our generated result (in column(f)) retains this 
detail information which may be very important to distin¬ 
guish this man from others. Another example is the woman 
in column (a) of the second row. She has a distinctive fea¬ 
ture comparing to other persons in the database, for she has 
two big eyes. Our method can also capture this detail in the 
pseudo-sketch (in column(b)). 

It suggests that synthesis based methods may fail in some 
cases. The main reason is that they are under an assumption 
that the synthesized sketch image (or patch) can be recon¬ 
structed by the sketch images (or patches) in the training 
set. However, this assumption may be too strong, for some 
persons have their facial distinctions in fact. 

Our generative method overcomes this difficulty. We try 
to directly learn the transformation from photo space to 
sketch space. Even though it’s a complex nonlinear map¬ 
ping, FCN has the capacity to handle this challenge. 

Compared to traditional synthesized method, our novel 
generative approach has two folds of advantages. Firstly, it 
































































Figure 3: Compared to synthesis based method m , our generative approach could retain more details in 
the original photos, e.g. the two locks of hair in forehead of the man in column (e) of row 1; the two big 
eyes of the woman in column (a) of row 2. (a)&(e) photos; (b)&;(f) our generative pseudo-sketches; (c)&;(g) 
synthesized pseudo-sketches in [ 23 ]; (d)&(h) sketches drawn by the artist. 






could retain more detail information from the photos. Sec¬ 
ondly, our inference time is independent on the amount of 
training data. The time cost in synthesis based method 
grows linearly with the data amounts, while the runtime of 
our approach is only affected by the size of the input photo. 

4.2 Sketch-Based Face Verification 

In this subsection, we will show that our generated pseudo¬ 
sketch significantly reduce the modality between photos and 
sketches. We follow the same testing procedure in mini 
124) . which can be concluded in two steps: (a) convert the 
photos in testing set into corresponding pseudo-sketches; (b) 
define a feature or transformation to measure the distance 
between the query sketch and the pseudo-sketches. 

In our implementation, we use our model to generate the 
pseudo-sketches in procedure (a), and use our generative loss 
(see Equation §) as the distance measurement. Since the 
sizes of our pseudo-sketches are (188 x 143), we also crop 
the query sketch into (188 x 143). 

Following the same protocol described in [22], we compare 
our approach with previous methods on cumulative match 
score (CMS) in Table [l] CMS measures the percentage 
of “the correct answer is in the top n matches”, where n is 
called the rank. 

As shown in Table [I] we form a baseline experiment, 
which is to convert the photo into gray scale as somehow 
“pseudo-sketch” to clarify the modality difference between 
photo space and sketch space. In this baseline, the accuracy 
for the first rank only equals to 41%, which is far away from 
satisfying. In early method like ET described in [22], has 
got 71% for the top one match. Latter trials in [24] , .28], 
m have greatly improved the verification accuracy on top 
one candidate to 96%, 99% and even 100%. 

From the last row in Table [l] our approach also gets 100% 
correct answer on its first guess. It’s worth to notice that 
our approach is quite different from the former methods, for 
it applies generation instead of synthesis. 



Rank 1 

Rank 3 

Rank 5 

Rank 10 

Baseline 

41 

56 

59 

70 

ET [22 

71 

81 

88 

96 

MRF L2U 

96 

- 

- 

100 

MRF+ [23] 

99 

- 

- 

100 

svr EH 

100 

- 

- 

- 

Ours 

100 

100 

100 

100 


Table 1: Cumulative Match Scores (CMS) compari¬ 
son on full training set (88 photo-sketch pairs). Our 
method achieves 100% accuracy on the first guess. 

Moreover, our generative method is very robust and has 
excellent generalization ability. In Figure [4] it suggests that 
our method can using a portion of training data to get a 
100% accuracy for the first rank. For example, we only need 
5 training samples to get to 95% accuracy, and 27 training 
samples is enough to get a 100% score for top one candidate. 

We also design a verification on our optimization objec¬ 
tive on Figure [4] It suggests that the model trained with 
the discriminative regularizer consistently outperforms the 
model trained without discriminative regularizer. 

4.3 Empirical Study 



Figure 4: The model trained with discrimina¬ 
tive regularizer consistently outperforms the model 
trained without discriminative regularizer. Both 
model gets 100% accuracy on rank 1 candidate while 
involves less than half training samples, i.e. 27 sam¬ 
ples in 88 photo-sketch pairs. 


In this subsection, further analyses will be conducted on 
the factors which affect our photo-sketch generation perfor¬ 
mance. 

Different Network Architectures Our first concern is 
about the depth of the network and the numbers of filters. 
We use the network proposed in [3] (we call it SR network) 
in our early experiments. However, the result is not quite 
satisfying, for this network seems too small to learn the com¬ 
plex nonlinear mapping from photos to sketches. Thus, we 
improve the network by: (a) adding more layers to the net¬ 
work; and (b) adding more numbers of Liters to each layer. 

We conduct our experiments on three different network ar¬ 
chitectures. Due to their scales, we call them small, medium 
and large network respectively. 

The medium network architecture can be found in Figure 
[2] The difference between it and SR network please refer 
to Section 3.3. Moreover, the only difference between our 
small, medium and large network is their filter numbers. The 
medium network has two times of filters compared to the 
small one. And the large network doubles the filter numbers 
of the medium network. 

For comparison, we define a measurement call MPRL , 
which is short for multiscale pixel-wise reconstruction loss. 
It evaluates the pixel-wise accuracy on different scales. To 
explain MPRL , we firstly introduce pixel-wise reconstruc¬ 
tion loss (PRL) on one photo-sketch pair. 

For a photo-sketch pair, we denote the sketch as the ground 
truth (GT) and generated pseudo-sketch as the prediction 
(P). Both of their sizes are (W x H). Thus, PRL can be 
formulated as 


PRL = 


W x H 


\ 


EE (GT(x,y)-P(x,y)) 2 (5) 

x=l y =1 


MPRL is the multiscale version of PRL. In practice, for 
each photo-sketch pair, we rescale both the GT and the P 































Multiscale Pixel-wise Reconstruction Loss 


■ Scale=0.5 BScale=l BScale=2 



(a) SR Net (b) Small Net (c) Medium Net (d) Large Net 


Figure 5: Comparison on the MPRL measurement 
and generative pseudo-sketches via different net¬ 
work architectures. The model achieves better per¬ 
formance while to deeper the network or to add 
more filters, (a-d) pseudo-sketches generated via 
network in [3], our small network, our medium net¬ 
work and our large network. 


to three scales {0.5, 1, 2} and evaluate their PRL to form 
the MPRL. Then, for the whole test set, we evaluate all the 
pairs and average their MPRL as the final measurement. 

Figure [5] summarizes the results both in qualitative and 
quantitative. The MPRL of SR network is {34.5, 36.8, 
36.2}. Our small network reduces it to {32.6, 35.0, 34.4}. 
Then, our medium and large network get better performance 
as {32.1, 34.6, 34.0} and {30.0, 32.5, 31.9}. 

The pseudo-sketch examples generated by each network 
are on the bottom part of Figure [5] It suggests that the re¬ 
sults of SR network can capture the overall structures, but 
they are far away from satisfying in details. Then, Our small 
network’s generation is quite better and has clearer con¬ 
tours. Moreover, the pseudo-sketches generated by medium 
network are more vivid than the small ones. The results of 
large network are as good as medium ones, but the large 
network is much more time consuming. 

Table ED shows that our small network is almost as fast 
as the SR network. The medium network’s cost is about 2 
times of the small one, and the large network’s runtime is 
about three times of the medium one. 

To balance the effectiveness and efficiency, we prefer our 
medium network as the final choice. 



SR Net 

Small Net 

Medium Net 

Large Net 

Time 

8.5ms 

8.6ms 

17.1ms 

50.8ms 


Table 2: Runtime of single (155 X 200) image on 
NVIDIA Tesla K40 GPU 


Difference between trained with and without XY 
channels Now we consider another factor that influences 
the model’s performance. As mentioned above, we add XY 
channels to the RGB color channels. We assume these addi¬ 
tional spatial messages could help the network to distinguish 
different spatial patches and learning different representa¬ 
tion on them. 

Similarly, we use MPRL for quantitative measurement 
and use pseudo-sketch comparison for qualitative analysis. 
In Figure [6] the left part is for results generated by our 
medium network but trained without XY channels, and the 
right part is for results generated by our medium network 
trained normally. Figure [6] suggests that latter one outper¬ 
forms former one both in numbers and perception. 


Multiscale Pixel-wise Reconstruction Loss 

■ Scale=0.5 BScale=l BScale=2 



(a)w/o XY channels (b) w/XY channels 

Figure 6: Comparison on the MPRL measurement 
and generative pseudo-sketches via network trained 
without and with XY channels. Adding XY chan¬ 
nels can make the model more robust, (a) pseudo¬ 
sketches generated via network trained without XY 
channels; (b) pseudo-sketches generated via network 
trained with XY channels. 


5. CONCLUSION AND FUTURE WORK 

In this paper, we propose an end-to-end fully convolu¬ 
tional network in order to directly model the complex non¬ 
linear mapping between face photos and sketches. From the 
experiments, we find out that fully convolutional network 
is a powerful tool which can handle this difficult problem 
while providing a pixel-wise prediction both effectively and 
efficiently. 

However, this solution still has some limitations. Firstly, 
the synthesized pseudo-sketches in m have shaper edges 
and clearer contours than our generated ones. Secondly, the 
database involved in our experiments only contains Asiatic 
face images, which may limit the generalization ability of 
our model to other racial groups. 












In future work, we will further improve our loss function 
and try various databases in our experiments, and we may 
explore about the relation between our work and those in¬ 
volved with non-photorealistic rendering HU. 
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